To begin I wanted to explore the dataset features summary
## 'data.frame': 4898 obs. of 14 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## $ score : Ord.factor w/ 9 levels "1"<"2"<"3"<"4"<..: 6 6 6 6 6 6 6 6 6 6 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
##
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
##
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
##
## alcohol quality score
## Min. : 8.00 Min. :3.000 6 :2198
## 1st Qu.: 9.50 1st Qu.:5.000 5 :1457
## Median :10.40 Median :6.000 7 : 880
## Mean :10.51 Mean :5.878 8 : 175
## 3rd Qu.:11.40 3rd Qu.:6.000 4 : 163
## Max. :14.20 Max. :9.000 3 : 20
## (Other): 5
Let’s plot the distributions of the other features in the dataset:
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Warning: position_stack requires constant width: output may be incorrect
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
The data set has 4898 observations of 13 variables: $ X : int -> Progressive number $ fixed.acidity : num 3.8 - 14.2 $ volatile.acidity : num 0.08 - 1.1 $ citric.acid : num 0.00 - 1.17 $ residual.sugar : num 0.6 - 65.8 $ chlorides : num 0.009 - 0.34 $ free.sulfur.dioxide : num 2.0 - 289.0 $ total.sulfur.dioxide: num 9.0 - 440.0 $ density : num 0.987 - 1.039 $ pH : num 2.72 - 3.82 $ sulphates : num 0.22 - 1.08 $ alcohol : num 8.0 - 14.2 $ quality : int 3 - 9
All variables have gaussian distribution, except for residual sugar and alcohol. Alcohol variable is more widely distributed, almost linearly between 9.9 and 12.
Quality is an integer type, but can be considered as an ordinated factor, so I created the “score” ordered factor with quality value.
The main features I was interested in were quality and alcohol.
All the features are interesting. I suppose that wines with low acidity, chlorides and sulphates will score better than other wines.
Yes, I created a “score” variable, which is an ordinated factor of “quality”.
The alcohol distribution was unusual, it was not gaussian. The dataset was already in tity format and I did not have to make adjustments. As described above, I transformed the “quality” integer variable into an ordinated factor called “score”.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
There are too many variables, let’s select the most interesting
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
There is a high correlation between density and residual sugar
And also between density and alcohol
But there is not high correlation between alcohol and residual sugar
Alcohol seems to be the only variable strongly linked to quality
while other variables, according to the ggpairs plot correlation factors, have a lower impact on the quality.
Quality seems to be strongly correlated with alcohol (0.464), density (-0.328), chlorides (-0.22), volatile acidity (-0.168) and total sulfur dioxide (-0.157).
There is an evident correlation between density and residual sugar (0.828). This is due to the process of fermentation that transforms sugar (dense) to alcohol (less dense). This is confirmed by the negative correlation between residual sugar and alcohol (-0.435).
The strongest relationship I found is between density and residual sugar (0.828). This relationship can be explained by the natural wining process of sugar conversion into alcohol.
I also found another strong relationship between alcohol and wine quality (0.464).
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Let’s have a closer look to alcohol by density and alcohol by residual sugar, colored by quality score
As alcohol increases, we get more quality wines in both plots. In the first one, we can also see that, as the alcohol concentration increases, the density decreases.
By coloring the scatter plot of density by residual sugar we can notice that better wines have higher residual sugar.
Low density and low volatile acidity have both an impact on the wine quality, but there is no particular pattern correlating the two factors.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Wines with score 5 or lower are more concentrated on lower alcohol percentage.
Let’s create a linear model to see if we can predict quality based on the main correlated features.
##
## Calls:
## m1: lm(formula = (quality ~ alcohol), data = wines)
## m2: lm(formula = quality ~ alcohol + density, data = wines)
## m3: lm(formula = quality ~ alcohol + density + residual.sugar, data = wines)
## m4: lm(formula = quality ~ alcohol + density + residual.sugar + volatile.acidity,
## data = wines)
## m5: lm(formula = quality ~ alcohol + density + residual.sugar + volatile.acidity +
## chlorides, data = wines)
## m6: lm(formula = quality ~ alcohol + density + residual.sugar + volatile.acidity +
## chlorides + total.sulfur.dioxide, data = wines)
##
## =======================================================================================
## m1 m2 m3 m4 m5 m6
## ---------------------------------------------------------------------------------------
## (Intercept) 2.582*** -22.492*** 90.313*** 74.225*** 73.271*** 81.344***
## (0.098) (6.165) (12.374) (11.977) (11.999) (12.246)
## alcohol 0.313*** 0.360*** 0.246*** 0.286*** 0.283*** 0.284***
## (0.009) (0.015) (0.018) (0.018) (0.018) (0.018)
## density 24.728*** -87.886*** -71.546*** -70.514*** -78.777***
## (6.079) (12.317) (11.923) (11.949) (12.209)
## residual.sugar 0.053*** 0.052*** 0.052*** 0.053***
## (0.005) (0.005) (0.005) (0.005)
## volatile.acidity -2.059*** -2.044*** -2.077***
## (0.109) (0.110) (0.110)
## chlorides -0.692 -0.769
## (0.540) (0.540)
## total.sulfur.dioxide 0.001**
## (0.000)
## ---------------------------------------------------------------------------------------
## R-squared 0.190 0.192 0.210 0.264 0.264 0.266
## adj. R-squared 0.190 0.192 0.210 0.263 0.263 0.265
## sigma 0.797 0.796 0.787 0.760 0.760 0.759
## F 1146.395 583.290 434.085 438.646 351.293 295.042
## p 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -5839.391 -5831.127 -5776.812 -5604.126 -5603.301 -5598.094
## Deviance 3112.257 3101.773 3033.737 2827.187 2826.235 2820.233
## AIC 11684.782 11670.255 11563.624 11220.251 11220.603 11212.189
## BIC 11704.272 11696.241 11596.107 11259.231 11266.079 11264.161
## N 4898 4898 4898 4898 4898 4898
## =======================================================================================
Every feature is contributing in slightly increasing the accuracy of the model, but the overall result is not satisfactory. An r squared of 0.266 is very low.
There is a good correlation between density, residual sugar and alcohol.
##
## Calls:
## m10: lm(formula = (density ~ residual.sugar), data = wines)
## m11: lm(formula = density ~ residual.sugar + alcohol, data = wines)
##
## =====================================
## m10 m11
## -------------------------------------
## (Intercept) 0.991*** 1.005***
## (0.000) (0.000)
## residual.sugar 0.000*** 0.000***
## (0.000) (0.000)
## alcohol -0.001***
## (0.000)
## -------------------------------------
## R-squared 0.704 0.907
## adj. R-squared 0.704 0.907
## sigma 0.002 0.001
## F 11636.984 23791.076
## p 0.000 0.000
## Log-likelihood 24498.873 27328.019
## Deviance 0.013 0.004
## AIC -48991.747 -54648.037
## BIC -48972.257 -54622.051
## N 4898 4898
## =====================================
Infact this model is much better. Alcohol concentration and residual sugar are the main factors in determinating the density.
Yes, in general wines with lower density tend to have higher quality, while residual sugar does not seem to have a clear impact on the quality. Combining residual sugar and density, we can see that for a given density, wines with higher residual sugar have higher quality.
It was interesting how density is correlated with sugar and alcohol content. The longer the wine fermentation lasts, the lower is the residual sugar and the higher is the alcohol percentage. The final residual sugar and alcohol percentage are the main factors in density measure.
I created two models for the sample.
The first one to predict the quality of the wine based on the dataset features. This model was very weak, it had an R squared value of 0.266. It suggests that it is really hard to predict the quality of the wine based on the objective measurments of the wine chemical components.
The second model to predict the wine density based on residual sugar and alcohol. This model was quite accurate, with an R squared value of 0.9.
The first plot shows the quality distribution of the wines in the dataset. The dataset contains wines which scored from 3 to 9 in a distribution close to binobial.
There is a tendency for better wines (scoring 7 or above) to have a higher alcohol concentration. This almost linear correlation between score and alcohol concentration is only valid between the scores of 5 and 9 (included), but there is a countertendency for scores lower than 5. This countertendency makes the model function not reversible, therefore difficult to predict the score based on the alcohol percentage with a model.
## Warning: Removed 3 rows containing missing values (geom_point).
The plot shows how very good wines tend to have lower density and higher residual sugar. This confirms the precedent plot, because the wine should have a high percentage of alcohol to have high residual sugar and low density.
The wines dataset shows that the wine quality appreciated by the humans is far more complex than the objective parameters of the wine chemical composition observed in the data set. It is not possible to judge the wine quality on these parameters alone, but there are some features that do have an impact on the perceived quality of the wine. In general we tend to prefer wines with high alcohol concentration percentage, while factors like chlorides, volatile acidity and total sulfur dioxide have a bad impact on wine taste.
The dataset was tidy and clean, so I had the chance to dig directly into the analysis. The ggpairs plot was very useful in spotting the possible variable correlation and gave me several insights. I had some struggles in finding the ggpairs documentation and in formatting it for the kint file.
Some data that would be interesting to analyse would be for sure the geographical position (and height above the sea) and production year. I think that this features can have a significant factor in determinating the wine quality because altitude and weather can have an impact on the sugar quantity before fermentation, so would lead to a higher final alcohol volume and residual sugar.